-
Notifications
You must be signed in to change notification settings - Fork 52
Export generalizations #1387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Export generalizations #1387
Conversation
|
@CBroz1 I'm leaving this as a draft while I try it out, but would appreciate if you see any issues in the approach when you have time |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1387 +/- ##
==========================================
+ Coverage 69.79% 70.09% +0.30%
==========================================
Files 104 105 +1
Lines 12728 12917 +189
==========================================
+ Hits 8883 9054 +171
- Misses 3845 3863 +18 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CBroz1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! Seems on track, I just had some questions about subqueries and chunking
| if restriction | ||
| else self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have neither a passed restriction nor a self.restriction attr, we might save time by setting the restriction to True and returning early
src/spyglass/utils/mixins/export.py
Outdated
| if not ( | ||
| ( | ||
| isinstance(restr_str, str) | ||
| and (len(restr_str) > 2048) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still possible we'll bump this 2048 in the future. If so, it would be helpful to have a comment under ExportSelection.Table saying this was hard-coded here
It's also possible to fetch the value with self._export_table.Table.heading.attributes['restriction'].type[8:-1], but I would want to save that as a cached_property like self._export_restr_limit in that case to prevent that kind of table definition fetch on each iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll also add in a raised error in _insert_log if too long a restriction makes it to there.
src/spyglass/utils/mixins/export.py
Outdated
| if restriction | ||
| else self | ||
| ) | ||
| restricted_entries = restricted_table.fetch("KEY", log_export=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to state the logic: If a restriction exceeds our character limit, we will fetch the restricted table and recompile the restriction from chunks of entries. Yeah?
I wanted to confirm that resulting entries were treated as cumulative, so I dug through the export logic -> RestrGraph logic -> _set_restr method ... and yes, _set_restr by default merges passed entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. The new entries should be equivalent to those made by accessing the chunks of file entries in a sequential manner.
It loses information about what the restriction in the code was, but the entries specified should be the same
src/spyglass/utils/mixins/export.py
Outdated
| if chunk_size is None: | ||
| # estimate appropriate chunk size | ||
| chunk_size = max( | ||
| int(2048 // (len(restr_str) / len(restricted_entries))), 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kinds of speed gains would we see from increasing the varchar? I would imagine the merging later takes more time than fetching the longer varchar.
Empirical question, likely not worth the deep dive required to answer
src/spyglass/utils/mixins/export.py
Outdated
| i * chunk_size : (i + 1) * chunk_size | ||
| ] | ||
| chunk_restr_str = make_condition(self, chunk_entries, set()) | ||
| self._insert_log(chunk_restr_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a way to make this recursive to avoid ever hitting the varchar limit
def log_fetch(self, restr, ...):
if len(restr) < self._export_restr_limit:
self._insert_log(restr)
return
all_entries = (self & restr).fetch('KEY', log_export=False)
for chunk_entries in all_entries[ ... chunk slicing ... ]:
_ = (self & chunk_entries).fetch("KEY", log_export=True)This way, if chunking fails for whatever reason, it'll get re-chunked. The embedded case will be smaller than we want, probably slower, but that seems preferable to hitting this error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable. I'll try something like that It's also still technically possible to hit the limit if the primary key restriction for a single entry is >2048
CBroz1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work @samuelbray32 - esp the speed gains. Had some questions around motivation, and a desire to avoid globals where we can
| ), "table_to_undo must be a projection of table" | ||
|
|
||
| anti_alias_dict = { | ||
| attr.attribute_expression.strip("`"): attr.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a property for this here, such that you could ...
| attr.attribute_expression.strip("`"): attr.name | |
| attr.original_name : attr.name |
or just run this dict for all attrs and run the project for a full attr map: {'a':'a', 'orig':'name', 'c':'c'}
| if not ( | ||
| isinstance(restr_str, str) | ||
| and ( | ||
| (len(restr_str) > 2048) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep 2048 hard coded?
| "Export cannot handle subquery restrictions. Please submit a " | ||
| + "bug report on GitHub with the code you ran and this" | ||
| + f"restriction:\n\t{restr_str}" | ||
| "Single entry restriction exceeds 2048 characters.\n\t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe adjust to say 'cannot store' so that a user has more context if they hit
| ) | ||
| _ = trodes_pos_v1 * ( | ||
| common.IntervalList & "interval_list_name = 'pos 0 valid times'" | ||
| ) # Note for PR: table and restriction change because join no longer logs empty results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌🏼
| (pos_merge & merge_key).fetch_nwb() | ||
|
|
||
| ExportSelection.start_export(paper_id=1, analysis_id=5) | ||
| trodes_pos_v1._export_cache.clear() # Clear cache to ensure proj table is captured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something start_export should do?
CBroz1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Work still ongoing?
| dandi_api_key: Optional[str] = None, | ||
| dandi_instance: Optional[str] = "dandi", | ||
| skip_raw_files: Optional[bool] = False, | ||
| n_compile_processes: Optional[int] = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it's safe to assume a user will want n_processes to match across these cases, no? No harm in parsing them out, but I think tighter signatures are more likely to be leveraged
| shutil.copy(file, f"{destination_dir}/{os.path.basename(file)}") | ||
| else: | ||
| os.symlink(file, f"{destination_dir}/{os.path.basename(file)}") | ||
| # for file in source_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove?
| def _make_file_in_dandi_dir(file, destination_dir, skip_raw_files): | ||
| if os.path.exists(f"{destination_dir}/{os.path.basename(file)}"): | ||
| return | ||
| if skip_raw_files and raw_dir in file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, raw_dir refers to the path fetched from spyglass.settings, yeah? Not now, but I think it's worth adopting a caps convention for those to make it clear they're constants, and not missing func args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, you're saying use this for import
from spyglass.settings import raw_dir as RAW_DIR
an propagate changes accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying that a future PR should edit settings to raw_dir -> RAW_DIR for the whole codebase. This is fine for now
| @@ -0,0 +1,247 @@ | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding to h5_helper_fn? Might be confusing to import from both. Or, can rename the existing one to reflect recompute goals
| hits: List[str] = [] | ||
|
|
||
| def _visit(name: str, obj) -> None: | ||
| if not isinstance(obj, h5py.Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the current h5 code is doing this 'if is type, visit' logic, though embedded in the Comparison class
src/spyglass/utils/h5py_helper_fn.py
Outdated
|
|
||
|
|
||
| def convert_dataset_type(file: h5py.File, dataset_path: str, target_dtype: str): | ||
| """Convert a dataset to a different dtype 'in place' (-ish). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add note on the 'ish' - what makes it not totally in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The nwb file remains in place, but the hdmf dataset inside it is replaced with a new one (which has attributes set to match). There may be a signature left in the hdmf level. I'll clarify in the doc
src/spyglass/utils/h5py_helper_fn.py
Outdated
| new_dset.attrs[k] = v | ||
|
|
||
|
|
||
| def find_dynamic_tables_missing_id(nwb_path: str | Path) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does pipe operator type hinting work if a user is still on 3.9? I'd like to migrate to this, but I think we need to wait until we require 3.10
Export Improvements
Running the export pipeline for publication I came across two common barriers not previously handled:
Table1 & (Table2 & key)This PR solves both with the following:
In
_log_fetch:If either case applies:
- fetch the entry keys of the restricted table
- chunk into groups of entries that make a restriction shorter than the 2048 limit
- make an entry in
ExportSelection.Tablefor each of these chunks.I think this minimizes the adjustments users need to make to their code for exporting, while preserving the query ability in the
ExportSelectionentriesOther
_log_fetch_nwb#1390_log_fetchcall:nwb_file_name's #1393restrGraph(callable viagraph1 & graph2)nwb_files_includedNwbfileand limited to these filesExportSelectionto get the vertical database slice only dependent on the selected nwb filesExportSelection.Tableentries for a given table prior to creatingrestrGraphDandi Export Improvements
update_analysis_for_dandi_standardto check for missingsource_script_file_pathand add rdefault valuefloat 16#1403float-16data types and remake the hdmf dataset object asfloat32In progress
Checklist:
CITATION.cffaltersnippet for release notes.CHANGELOG.mdwith PR number and description.